GEM-PRO - Calculating Protein Properties

This notebook gives an example of how to calculate protein properties for a list of proteins. The main features demonstrated are:

Information retrieval from UniProt and linking residue numbering sites to structure
Calculating or predicting global protein sequence and structure properties
Calculating or predicting local protein sequence and structure properties

**Input:** List of gene IDs

**Output:** Representative protein structures and properties associated with them

Imports



In [1]:

    
import sys
import logging



In [2]:

    
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO



In [3]:

    
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don't affect running of the pipeline
INFO (default)
- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff

**Warning:** `DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!



In [4]:

    
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #



In [5]:

    
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization

Set these three things:

ROOT_DIR
- The directory where a folder named after your PROJECT will be created
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

    ROOT_DIR
    └── PROJECT
        ├── data  # General storage for pipeline outputs
        ├── model  # SBML and GEM-PRO models are stored here
        ├── genes  # Per gene information
        │   ├── <gene_id1>  # Specific gene directory
        │   │   └── protein
        │   │       ├── sequences  # Protein sequence files, alignments, etc.
        │   │       └── structures  # Protein structure files, calculations, etc.
        │   └── <gene_id2>
        │       └── protein
        │           ├── sequences
        │           └── structures
        ├── reactions  # Per reaction information
        │   └── <reaction_id1>  # Specific reaction directory
        │       └── complex
        │           └── structures  # Protein complex files
        └── metabolites  # Per metabolite information
            └── <metabolite_id1>  # Specific metabolite directory
                └── chemical
                    └── structures  # Metabolite 2D and 3D structure files

**Note:** Methods for protein complexes and metabolites are still in development.



In [6]:

    
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'ssbio_protein_properties'
LIST_OF_GENES = ['b1276', 'b0118']



In [7]:

    
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type='pdb')









    



[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: /tmp/ssbio_protein_properties: GEM-PRO project location
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2: number of genes

Mapping gene ID --> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

**Note:** You only need to map gene IDs using one service. However you can run both if some genes don't map in one service and do map in another!



In [8]:

    
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()









    



[2018-02-05 16:52] [root] INFO: getUserAgent: Begin
[2018-02-05 16:52] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 16:52] [root] INFO: getUserAgent: End






    





A Jupyter Widget






    









    



[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2/2: number of genes mapped to UniProt
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.






    



Missing UniProt mapping:  []






    Out[8]:







  
    
      
      uniprot
      reviewed
      gene_name
      kegg
      refseq
      pdbs
      pfam
      description
      entry_date
      entry_version
      seq_date
      seq_version
      sequence_file
      metadata_file
    
    
      gene
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      b0118
      P36683
      False
      acnB
      ecj:JW0114;eco:b0118
      NP_414660.1;WP_001307570.1
      1L5J
      PF00330;PF06434;PF11791
      Aconitate hydratase B
      2018-01-31
      165
      1997-11-01
      3
      P36683.fasta
      P36683.xml
    
    
      b1276
      P25516
      False
      acnA
      ecj:JW1268;eco:b1276
      NP_415792.1;WP_000099535.1
      NaN
      PF00330;PF00694
      Aconitate hydratase A
      2018-01-31
      153
      2008-01-15
      3
      P25516.fasta
      P25516.xml



In [9]:

    
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()









    





A Jupyter Widget






    









    



[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative sequence
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.






    



Missing a representative sequence:  []






    Out[9]:







  
    
      
      uniprot
      kegg
      pdbs
      sequence_file
      metadata_file
    
    
      gene
      
      
      
      
      
    
  
  
    
      b0118
      P36683
      ecj:JW0114;eco:b0118
      1L5J
      P36683.fasta
      P36683.xml
    
    
      b1276
      P25516
      ecj:JW1268;eco:b1276
      NaN
      P25516.fasta
      P25516.xml

Mapping representative sequence --> structure

These are the ways to map sequence to structure:

Use the UniProt ID and their automatic mappings to the PDB
BLAST the sequence to the PDB
Make homology models or
Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you'll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.



In [10]:

    
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()









    



[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 16:52] [root] INFO: getUserAgent: Begin
[2018-02-05 16:52] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 16:52] [root] INFO: getUserAgent: End






    





A Jupyter Widget






    









    



[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 1/2: number of genes with at least one experimental structure
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.






    Out[10]:







  
    
      
      pdb_id
      pdb_chain_id
      uniprot
      experimental_method
      resolution
      coverage
      start
      end
      unp_start
      unp_end
      rank
    
    
      gene
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      b0118
      1l5j
      A
      P36683
      X-ray diffraction
      2.4
      1
      1
      865
      1
      865
      1
    
    
      b0118
      1l5j
      B
      P36683
      X-ray diffraction
      2.4
      1
      1
      865
      1
      865
      2



In [11]:

    
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.7, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)









    





A Jupyter Widget






    









    



[2018-02-05 16:53] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 16:53] [ssbio.pipeline.gempro] INFO: 0: number of genes with additional structures added from BLAST
[2018-02-05 16:53] [ssbio.pipeline.gempro] WARNING: Empty dataframe






    Out[11]:

Homology models

Below, we are mapping to previously generated homology models for E. coli. If you are running this as a tutorial, they won't exist on your computer, so you can skip these steps.



In [15]:

    
import pandas as pd
import os.path as op



In [16]:

    
# Creating manual mapping dictionary for ECOLI I-TASSER models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/zhang/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/zhang_data/160804-ZHANG_INFO.csv')
tmp = homology_models_df[['zhang_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['zhang_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                        'file_type':'pdb'}}
    
my_gempro.get_manual_homology_models(homology_model_dict)









    





A Jupyter Widget






    









    



[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.



In [17]:

    
# Creating manual mapping dictionary for ECOLI SUNPRO models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/sunpro/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/sunpro_data/160609-SUNPRO_INFO.csv')
tmp = homology_models_df[['sunpro_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['sunpro_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                         'file_type':'pdb'}}
    
my_gempro.get_manual_homology_models(homology_model_dict)









    





A Jupyter Widget






    









    



[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.

Downloading and ranking structures

**Warning:** Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.



In [18]:

    
# Download all mapped PDBs and gather the metadata
my_gempro.download_all_pdbs()
my_gempro.df_pdb_metadata.head(2)









    





A Jupyter Widget






    









    



[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Saved 1 structures total






    Out[18]:







  
    
      
      pdb_id
      pdb_title
      description
      experimental_method
      mapped_chains
      resolution
      chemicals
      taxonomy_name
      structure_file
    
    
      gene
      
      
      
      
      
      
      
      
      
    
  
  
    
      b0118
      1l5j
      CRYSTAL STRUCTURE OF E. COLI ACONITASE B.
      Aconitate hydratase 2 (E.C.4.2.1.3)
      X-ray diffraction
      A;B
      2.4
      F3S;TRA
      Escherichia coli
      1l5j.pdb



In [19]:

    
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()









    





A Jupyter Widget






    









    



[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.






    Out[19]:







  
    
      
      id
      is_experimental
      file_type
      structure_file
    
    
      gene
      
      
      
      
    
  
  
    
      b0118
      REP-1l5j
      True
      pdb
      1l5j-A_clean.pdb
    
    
      b1276
      REP-ACON1_ECOLI
      False
      pdb
      ACON1_ECOLI_model1_clean-X_clean.pdb

Computing and storing protein properties



In [20]:

    
# Requires EMBOSS "pepstats" program
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
# Install using:
# sudo apt-get install emboss
my_gempro.get_sequence_properties()









    





A Jupyter Widget



In [ ]:

    
# Requires SCRATCH installation, replace path_to_scratch with own path to script
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_scratch_predictions(path_to_scratch='scratch', 
                                  results_dir=my_gempro.data_dir,
                                  num_cores=4)



In [22]:

    
my_gempro.find_disulfide_bridges(representatives_only=False)









    





A Jupyter Widget



In [23]:

    
# Requires DSSP installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_dssp_annotations()









    





A Jupyter Widget



In [24]:

    
# Requires MSMS installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_msms_annotations()









    





A Jupyter Widget

Additional annotations

Loading feature files to the representative sequence

"Features" are currently loaded directly from UniProt, but if another feature file is available for each protein, it can be loaded manually.



In [25]:

    
# for g in my_gempro.genes_with_a_representative_sequence:
#     g.protein.representative_sequence.feature_path = '/path/to/new/feature/file.gff'

Adding more properties

Additional global or local properties can be loaded after loading the saved GEM-PRO.

Make sure to add 'seq_hydrophobicity-kd' to the list of columns to be returned later on!

Example with hydrophobicity



In [26]:

    
# Kyte-Doolittle scale for hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
       'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
       'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
       'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }



In [27]:

    
# Use Biopython to calculated hydrophobicity using a set sliding window length
from Bio.SeqUtils.ProtParam import ProteinAnalysis

window = 7

for g in my_gempro.genes_with_a_representative_sequence:
    # Create a ProteinAnalysis object -- see http://biopython.org/wiki/ProtParam
    my_seq = g.protein.representative_sequence.seq_str
    analysed_seq = ProteinAnalysis(my_seq)
    
    # Calculate scale
    hydrophobicity = analysed_seq.protein_scale(param_dict=kd, window=window)
    
    # Correct list length by prepending and appending "inf" (result needs to be same length as sequence)
    for i in range(window//2):
        hydrophobicity.insert(0, float("Inf"))
        hydrophobicity.append(float("Inf"))
        
    # Add new annotation to the representative sequence's "letter_annotations" dictionary
    g.protein.representative_sequence.letter_annotations['hydrophobicity-kd'] = hydrophobicity

Global protein properties

Properties of the entire protein sequence/structure are stored in:

The representative_sequence annotations field
The representative_structure's representative chain SeqRecord

These properties describe aspects of the entire protein, such as its molecular weight, the percentage of amino acids in a particular secondary structure, etc.



In [28]:

    
# Printing all global protein properties

from pprint import pprint

# Only looking at 2 genes for now, remove [:2] to gather properties for all
for g in my_gempro.genes_with_a_representative_sequence[:2]:
    repseq = g.protein.representative_sequence
    repstruct = g.protein.representative_structure
    repchain = g.protein.representative_chain
    
    print('Gene: {}'.format(g.id))
    print('Number of structures: {}'.format(g.protein.num_structures))
    print('Representative sequence: {}'.format(repseq.id))
    print('Representative structure: {}'.format(repstruct.id))
    
    print('----------------------------------------------------------------')
    print('Global properties of the representative sequence:')
    pprint(repseq.annotations)
    
    print('----------------------------------------------------------------')
    print('Global properties of the representative structure:')
    pprint(repstruct.chains.get_by_id(repchain).seq_record.annotations)
    
    print('****************************************************************')
    print('****************************************************************')
    print('****************************************************************')









    



Gene: b1276
Number of structures: 3
Representative sequence: P25516
Representative structure: REP-ACON1_ECOLI
----------------------------------------------------------------
Global properties of the representative sequence:
{'amino_acids_percent-biop': {'A': 0.08641975308641975,
                              'C': 0.007856341189674524,
                              'D': 0.06397306397306397,
                              'E': 0.06172839506172839,
                              'F': 0.025813692480359147,
                              'G': 0.08754208754208755,
                              'H': 0.020202020202020204,
                              'I': 0.04826038159371493,
                              'K': 0.04826038159371493,
                              'L': 0.09427609427609428,
                              'M': 0.028058361391694726,
                              'N': 0.037037037037037035,
                              'P': 0.05611672278338945,
                              'Q': 0.030303030303030304,
                              'R': 0.05723905723905724,
                              'S': 0.05723905723905724,
                              'T': 0.06060606060606061,
                              'V': 0.0819304152637486,
                              'W': 0.014590347923681257,
                              'Y': 0.03254769921436588},
 'aromaticity-biop': 0.07295173961840629,
 'instability_index-biop': 36.28239057239071,
 'isoelectric_point-biop': 5.59344482421875,
 'molecular_weight-biop': 97676.06830000057,
 'monoisotopic-biop': False,
 'percent_acidic-pepstats': 0.1257,
 'percent_aliphatic-pepstats': 0.31089,
 'percent_aromatic-pepstats': 0.09315,
 'percent_basic-pepstats': 0.1257,
 'percent_charged-pepstats': 0.2514,
 'percent_helix_naive-biop': 0.29741863075196406,
 'percent_non-polar-pepstats': 0.56341,
 'percent_polar-pepstats': 0.43659,
 'percent_small-pepstats': 0.53872,
 'percent_strand_naive-biop': 0.27048260381593714,
 'percent_tiny-pepstats': 0.29966000000000004,
 'percent_turn_naive-biop': 0.2379349046015713}
----------------------------------------------------------------
Global properties of the representative structure:
{'percent_B-dssp': 0.010101010101010102,
 'percent_C-dssp': 0.2222222222222222,
 'percent_E-dssp': 0.1739618406285073,
 'percent_G-dssp': 0.03928170594837262,
 'percent_H-dssp': 0.345679012345679,
 'percent_I-dssp': 0.005611672278338945,
 'percent_S-dssp': 0.09427609427609428,
 'percent_T-dssp': 0.10886644219977554}
****************************************************************
****************************************************************
****************************************************************
Gene: b0118
Number of structures: 4
Representative sequence: P36683
Representative structure: REP-1l5j
----------------------------------------------------------------
Global properties of the representative sequence:
{'amino_acids_percent-biop': {'A': 0.11213872832369942,
                              'C': 0.011560693641618497,
                              'D': 0.06358381502890173,
                              'E': 0.06589595375722543,
                              'F': 0.03352601156069364,
                              'G': 0.08786127167630058,
                              'H': 0.017341040462427744,
                              'I': 0.05433526011560694,
                              'K': 0.056647398843930635,
                              'L': 0.10173410404624278,
                              'M': 0.026589595375722544,
                              'N': 0.035838150289017344,
                              'P': 0.06242774566473988,
                              'Q': 0.028901734104046242,
                              'R': 0.04508670520231214,
                              'S': 0.04161849710982659,
                              'T': 0.05433526011560694,
                              'V': 0.06705202312138728,
                              'W': 0.009248554913294798,
                              'Y': 0.024277456647398842},
 'aromaticity-biop': 0.06705202312138728,
 'instability_index-biop': 32.79631213872841,
 'isoelectric_point-biop': 5.23931884765625,
 'molecular_weight-biop': 93497.01500000065,
 'monoisotopic-biop': False,
 'percent_acidic-pepstats': 0.12948,
 'percent_aliphatic-pepstats': 0.33526000000000006,
 'percent_aromatic-pepstats': 0.08439,
 'percent_basic-pepstats': 0.11907999999999999,
 'percent_charged-pepstats': 0.24855,
 'percent_helix_naive-biop': 0.29017341040462424,
 'percent_non-polar-pepstats': 0.59075,
 'percent_polar-pepstats': 0.40924999999999995,
 'percent_small-pepstats': 0.53642,
 'percent_strand_naive-biop': 0.3063583815028902,
 'percent_tiny-pepstats': 0.30751,
 'percent_turn_naive-biop': 0.22774566473988442}
----------------------------------------------------------------
Global properties of the representative structure:
{'percent_B-dssp': 0.016241299303944315,
 'percent_C-dssp': 0.20765661252900233,
 'percent_E-dssp': 0.14037122969837587,
 'percent_G-dssp': 0.03480278422273782,
 'percent_H-dssp': 0.3805104408352668,
 'percent_I-dssp': 0.0,
 'percent_S-dssp': 0.08236658932714618,
 'percent_T-dssp': 0.13805104408352667}
****************************************************************
****************************************************************
****************************************************************

Local protein properties

Properties of specific residues are stored in:

The representative_sequence's letter_annotations attribute
The representative_structure's representative chain SeqRecord

Specific sites, like metal or metabolite binding sites, can be found in the representative_sequence's features attribute. This information is retrieved from UniProt. The below examples extract features for the metal binding sites.

The properties related to those sites can be retrieved using the function get_residue_annotations.

UniProt contains more information than just "sites"



In [29]:

    
# Looking at all features
for g in my_gempro.genes_with_a_representative_sequence[:2]:
    g.id
    
    # UniProt features
    [x for x in g.protein.representative_sequence.features]
    
    # Catalytic site atlas features
    for s in g.protein.structures:
        if s.structure_file:
            for c in s.mapped_chains:
                if s.chains.get_by_id(c).seq_record:
                    if s.chains.get_by_id(c).seq_record.features:
                        [x for x in s.chains.get_by_id(c).seq_record.features]









    Out[29]:





'b1276'






    Out[29]:





[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1)), type='initiator methionine'),
 SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(891)), type='chain', id='PRO_0000076661'),
 SeqFeature(FeatureLocation(ExactPosition(434), ExactPosition(435)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(500), ExactPosition(501)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(503), ExactPosition(504)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(521), ExactPosition(522)), type='sequence conflict')]






    Out[29]:





'b0118'






    Out[29]:





[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(865)), type='chain', id='PRO_0000076675'),
 SeqFeature(FeatureLocation(ExactPosition(243), ExactPosition(246)), type='region of interest'),
 SeqFeature(FeatureLocation(ExactPosition(413), ExactPosition(416)), type='region of interest'),
 SeqFeature(FeatureLocation(ExactPosition(709), ExactPosition(710)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(771), ExactPosition(772)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(190), ExactPosition(191)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(497), ExactPosition(498)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(790), ExactPosition(791)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(796)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='mutagenesis site'),
 SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(14)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(35)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(41), ExactPosition(51)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(58), ExactPosition(72)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(82), ExactPosition(90)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(98), ExactPosition(104)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(104), ExactPosition(107)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(108), ExactPosition(111)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(111), ExactPosition(120)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(127), ExactPosition(137)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(140), ExactPosition(151)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(153), ExactPosition(157)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(165), ExactPosition(178)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(178), ExactPosition(182)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(184), ExactPosition(190)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(193), ExactPosition(197)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(197), ExactPosition(200)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(213), ExactPosition(216)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(219), ExactPosition(227)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(232), ExactPosition(243)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(247), ExactPosition(257)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(257), ExactPosition(261)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(263), ExactPosition(269)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(271), ExactPosition(278)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(279), ExactPosition(288)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(291), ExactPosition(295)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(305), ExactPosition(310)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(310), ExactPosition(314)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(314), ExactPosition(318)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(318), ExactPosition(321)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(323), ExactPosition(327)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(333), ExactPosition(341)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(343), ExactPosition(360)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(383), ExactPosition(391)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(391), ExactPosition(394)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(408), ExactPosition(413)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(414), ExactPosition(417)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(417), ExactPosition(427)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(437), ExactPosition(440)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(443), ExactPosition(448)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(450), ExactPosition(465)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(465), ExactPosition(468)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(478), ExactPosition(483)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(483), ExactPosition(486)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(491), ExactPosition(497)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(502), ExactPosition(507)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(511), ExactPosition(521)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(530), ExactPosition(538)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(545), ExactPosition(559)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(572), ExactPosition(576)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(576), ExactPosition(583)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(588), ExactPosition(597)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(597), ExactPosition(600)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(600), ExactPosition(603)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(604), ExactPosition(609)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(612), ExactPosition(632)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(637), ExactPosition(653)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(666), ExactPosition(673)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(673), ExactPosition(676)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(680), ExactPosition(683)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(690), ExactPosition(693)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(693), ExactPosition(696)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(696), ExactPosition(699)), type='turn'),
 SeqFeature(FeatureLocation(ExactPosition(703), ExactPosition(707)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(713), ExactPosition(726)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(731), ExactPosition(737)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(741), ExactPosition(750)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(752), ExactPosition(760)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(769), ExactPosition(772)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(774), ExactPosition(777)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(783), ExactPosition(790)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(798)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(801), ExactPosition(805)), type='strand'),
 SeqFeature(FeatureLocation(ExactPosition(807), ExactPosition(817)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(822), ExactPosition(834)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(836), ExactPosition(840)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(845), ExactPosition(848)), type='helix'),
 SeqFeature(FeatureLocation(ExactPosition(849), ExactPosition(857)), type='helix')]



In [30]:

    
metal_info = []

for g in my_gempro.genes:
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
            res_info['gene_id'] = g.id
            res_info['seq_id'] = g.protein.representative_sequence.id
            res_info['struct_id'] = g.protein.representative_structure.id
            res_info['chain_id'] = g.protein.representative_chain
            metal_info.append(res_info)
        
cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
        'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
        'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
        'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp', 
        'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']

pd.DataFrame.from_records(metal_info, columns=cols).set_index(['gene_id', 'seq_id', 'struct_id', 'chain_id', 'seq_resnum'])









    Out[30]:







  
    
      
      
      
      
      
      seq_residue
      struct_residue
      struct_resnum
      seq_SS-sspro
      seq_SS-sspro8
      seq_RSA-accpro
      seq_RSA-accpro20
      struct_SS-dssp
      struct_RSA-dssp
      struct_ASA-dssp
      struct_PHI-dssp
      struct_PSI-dssp
      struct_CA_DEPTH-msms
      struct_RES_DEPTH-msms
    
    
      gene_id
      seq_id
      struct_id
      chain_id
      seq_resnum
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      b1276
      P25516
      REP-ACON1_ECOLI
      X
      435
      C
      C
      435
      NaN
      NaN
      NaN
      NaN
      H
      0.059259
      8.0
      -61.1
      -26.6
      2.656722
      2.813536
    
    
      501
      C
      C
      501
      NaN
      NaN
      NaN
      NaN
      S
      0.088889
      12.0
      -61.0
      -50.0
      1.999713
      2.409119
    
    
      504
      C
      C
      504
      NaN
      NaN
      NaN
      NaN
      G
      0.259259
      35.0
      -56.0
      -45.6
      1.999634
      1.961484
    
    
      b0118
      P36683
      REP-1l5j
      A
      710
      C
      C
      710
      NaN
      NaN
      NaN
      NaN
      T
      0.118519
      16.0
      -67.1
      -7.2
      10.148960
      10.009109
    
    
      769
      C
      C
      769
      NaN
      NaN
      NaN
      NaN
      -
      0.088889
      12.0
      -67.8
      -28.3
      8.296585
      8.049832
    
    
      772
      C
      C
      772
      NaN
      NaN
      NaN
      NaN
      G
      0.081481
      11.0
      -50.2
      -38.0
      8.282292
      8.239369

Column definitions

General

gene_id: Gene ID used in GEM-PRO project
seq_id: Representative protein sequence ID
struct_id: Representative protein structure ID, with REP- prepended to it. 4 letter structure IDs are experimental structures from the PDB, others are homology models
chain_id: Representative chain ID in the representative structure

General -- residue specific

seq_resnum: Residue number of the amino acid in the representative sequence
site_name: Name of the feature as defined in UniProt
seq_residue: Amino acid in the representative sequence at the residue number
struct_residue: Amino acid in the representative structure at the residue number
struct_resnum: Residue number of the amino acid in the representative structure

Residue-level properties calculated or predicted from sequence:

seq_SS-sspro: Predicted secondary structure, 3 definitions (from the SCRATCH program)
seq_SS-sspro8: Predicted secondary structure, 8 definitions (SCRATCH)
seq_RSA-accpro: Predicted exposed (e) or buried (-) residue (SCRATCH)
seq_RSA-accpro20: Predicted exposed/buried, 0 to 100 scale (SCRATCH)

Residue-level properties calculated from structure:

struct_SS-dssp: Secondary structure (DSSP program)
struct_RSA-dssp: Relative solvent accessibility (DSSP)
struct_ASA-dssp: Solvent accessibility, absolute value (DSSP)
struct_PHI-dssp: Phi angle measure (DSSP)
struct_PSI-dssp: Psi angle measure (DSSP)
struct_RES_DEPTH-msms: Calculated residue depth averaged for all atoms in the residue (MSMS program)
struct_CA_DEPTH-msms: Calculated residue depth for the carbon alpha atom (MSMS)

Visualizing residues



In [31]:

    
for g in my_gempro.genes:
    
    # Gather residue numbers
    metal_binding_structure_residues = []
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
            metal_binding_structure_residues.append(res_info['struct_resnum'])
    print(metal_binding_structure_residues)
    
    # Display structure
    view = g.protein.representative_structure.view_structure()
    g.protein.representative_structure.add_residues_highlight_to_nglview(view=view, structure_resnums=metal_binding_structure_residues)
    
    view









    



[435, 501, 504]






    



[2018-02-05 17:00] [ssbio.protein.structure.structprop] INFO: Selection: ( :X ) and not hydrogen and ( 504 or 435 or 501 )






    





A Jupyter Widget






    



[710, 769, 772]






    



[2018-02-05 17:00] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( 769 or 772 or 710 )






    





A Jupyter Widget

Comparing features in different structures of the same protein



In [32]:

    
# Run all sequence to structure alignments
for g in my_gempro.genes:
    for s in g.protein.structures:
        g.protein.align_seqprop_to_structprop(seqprop=g.protein.representative_sequence, structprop=s)



In [33]:

    
metal_info_compared = []

for g in my_gempro.genes:
    for f in g.protein.representative_sequence.features:
        if 'metal' in f.type.lower():
            for s in g.protein.structures:
                for c in s.mapped_chains:
                    res_info = g.protein.get_residue_annotations(seq_resnum=f.location.end, 
                                                                 seqprop=g.protein.representative_sequence,
                                                                 structprop=s, chain_id=c,
                                                                 use_representatives=False)
                    res_info['gene_id'] = g.id
                    res_info['seq_id'] = g.protein.representative_sequence.id
                    res_info['struct_id'] = s.id
                    res_info['chain_id'] = c
                    metal_info_compared.append(res_info)
    
cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
        'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
        'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
        'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp', 
        'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']

pd.DataFrame.from_records(metal_info_compared, columns=cols).sort_values(by=['seq_resnum','struct_id','chain_id']).set_index(['gene_id','seq_id','seq_resnum','seq_residue','struct_id'])









    Out[33]:







  
    
      
      
      
      
      
      chain_id
      struct_residue
      struct_resnum
      seq_SS-sspro
      seq_SS-sspro8
      seq_RSA-accpro
      seq_RSA-accpro20
      struct_SS-dssp
      struct_RSA-dssp
      struct_ASA-dssp
      struct_PHI-dssp
      struct_PSI-dssp
      struct_CA_DEPTH-msms
      struct_RES_DEPTH-msms
    
    
      gene_id
      seq_id
      seq_resnum
      seq_residue
      struct_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      b1276
      P25516
      435
      C
      ACON1_ECOLI
      X
      C
      435
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      E01201
      X
      C
      435
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      REP-ACON1_ECOLI
      X
      C
      435
      NaN
      NaN
      NaN
      NaN
      H
      0.059259
      8.0
      -61.1
      -26.6
      2.656722
      2.813536
    
    
      501
      C
      ACON1_ECOLI
      X
      C
      501
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      E01201
      X
      C
      501
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      REP-ACON1_ECOLI
      X
      C
      501
      NaN
      NaN
      NaN
      NaN
      S
      0.088889
      12.0
      -61.0
      -50.0
      1.999713
      2.409119
    
    
      504
      C
      ACON1_ECOLI
      X
      C
      504
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      E01201
      X
      C
      504
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      REP-ACON1_ECOLI
      X
      C
      504
      NaN
      NaN
      NaN
      NaN
      G
      0.259259
      35.0
      -56.0
      -45.6
      1.999634
      1.961484
    
    
      b0118
      P36683
      710
      C
      1l5j
      A
      C
      710
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1l5j
      B
      C
      710
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      ACON2_ECOLI
      X
      C
      710
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      E00113
      X
      C
      710
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      REP-1l5j
      A
      C
      710
      NaN
      NaN
      NaN
      NaN
      T
      0.118519
      16.0
      -67.1
      -7.2
      10.148960
      10.009109
    
    
      769
      C
      1l5j
      A
      C
      769
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1l5j
      B
      C
      769
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      ACON2_ECOLI
      X
      C
      769
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      E00113
      X
      C
      769
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      REP-1l5j
      A
      C
      769
      NaN
      NaN
      NaN
      NaN
      -
      0.088889
      12.0
      -67.8
      -28.3
      8.296585
      8.049832
    
    
      772
      C
      1l5j
      A
      C
      772
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1l5j
      B
      C
      772
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      ACON2_ECOLI
      X
      C
      772
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      E00113
      X
      C
      772
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      REP-1l5j
      A
      C
      772
      NaN
      NaN
      NaN
      NaN
      G
      0.081481
      11.0
      -50.2
      -38.0
      8.282292
      8.239369

	uniprot	reviewed	gene_name	kegg	refseq	pdbs	pfam	description	entry_date	entry_version	seq_date	seq_version	sequence_file	metadata_file
gene
b0118	P36683	False	acnB	ecj:JW0114;eco:b0118	NP_414660.1;WP_001307570.1	1L5J	PF00330;PF06434;PF11791	Aconitate hydratase B	2018-01-31	165	1997-11-01	3	P36683.fasta	P36683.xml
b1276	P25516	False	acnA	ecj:JW1268;eco:b1276	NP_415792.1;WP_000099535.1	NaN	PF00330;PF00694	Aconitate hydratase A	2018-01-31	153	2008-01-15	3	P25516.fasta	P25516.xml

	id	is_experimental	file_type	structure_file
gene
b0118	REP-1l5j	True	pdb	1l5j-A_clean.pdb
b1276	REP-ACON1_ECOLI	False	pdb	ACON1_ECOLI_model1_clean-X_clean.pdb

					seq_residue	struct_residue	struct_resnum	seq_SS-sspro	seq_SS-sspro8	seq_RSA-accpro	seq_RSA-accpro20	struct_SS-dssp	struct_RSA-dssp	struct_ASA-dssp	struct_PHI-dssp	struct_PSI-dssp	struct_CA_DEPTH-msms	struct_RES_DEPTH-msms
gene_id	seq_id	struct_id	chain_id	seq_resnum
b1276	P25516	REP-ACON1_ECOLI	X	435	C	C	435	NaN	NaN	NaN	NaN	H	0.059259	8.0	-61.1	-26.6	2.656722	2.813536
				501	C	C	501	NaN	NaN	NaN	NaN	S	0.088889	12.0	-61.0	-50.0	1.999713	2.409119
				504	C	C	504	NaN	NaN	NaN	NaN	G	0.259259	35.0	-56.0	-45.6	1.999634	1.961484
b0118	P36683	REP-1l5j	A	710	C	C	710	NaN	NaN	NaN	NaN	T	0.118519	16.0	-67.1	-7.2	10.148960	10.009109
				769	C	C	769	NaN	NaN	NaN	NaN	-	0.088889	12.0	-67.8	-28.3	8.296585	8.049832
				772	C	C	772	NaN	NaN	NaN	NaN	G	0.081481	11.0	-50.2	-38.0	8.282292	8.239369